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In all but special circumstances, measurements of time-dependent processes reflect internal struc- 
tures and correlations only indirectly. Building predictive models of such hidden information sources 
requires discovering, in some way, the internal states and mechanisms. Unfortunately, there are of- 
ten many possible models that are observationally equivalent. Here we show that the situation is 
not as arbitrary as one would think. We show that generators of hidden stochastic processes can be 
reduced to a minimal form and compare this reduced representation to that provided by computa- 
tional mechanics — the e-machine. On the way to developing deeper, measure-theoretic foundations 
for the latter, we introduce a new two-step reduction process. The first step (internal-event re- 
duction) produces the smallest observationally equivalent a-algebra and the second (internal-state 
reduction) removes a-algebra components that are redundant for optimal prediction. For several 
classes of stochastic dynamical systems these reductions produce representations that are equivalent 
to e- machines. 
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I. INTRODUCTION 

Experiment and simulation often produce voluminous 
amounts of data — data that the scientist or analyst at- 
tempts to understand by building predictive models. The 
best models, however, do more than simply predict the 
data. In the best of circumstances, models also capture 
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the internal structures, active degrees of freedom, corre- 
lations, and so on that underlie the observations. In this 
way modeling enhances understanding and leads to new 
insights about the forces that shape our world. 

Unfortunately, measurements generally are only indi- 
rect indicators of internal structure. This makes the 
process of model building difficult and often highly 
nonunique. One would hope that there is some princi- 
pled approach to model building and inference that would 
guide us in inferring structural properties from data. The 
possibilities for such an approach are bounded by two ex- 
tremes: (i) Are there formal constraints that guide the 
discovery of good representations? (ii) Can the observa- 
tions themselves tell us which representation to use or, 
perhaps, how to correct an initially faulty hypothesis? 

These days, however, the problem of building useful 
predictors of hidden information sources is compounded 
by the fact that the systems studied are quite compli- 
cated, in the sense of consisting of many components, for 
example. Genomic, geophysical, neurobiological, Inter- 
net traffic, and World Wide Web systems easily come to 
mind as complex in this sense and as particularly desir- 
able to model. This very practical observation, in turn, 
argues even more forcefully for a principled approach to 
discovering and describing hidden structure. That is, we 
now need to understand the process of model building for 
such complicated systems well enough to teach machines 
how to do it. 

Here, building on previous work Q, |^ Q , we address 
one piece in this puzzle — what we call the Forward Mod- 
eling Problem: Given a generator of an observed stochas- 
tic process. Is there a minimal, optimal predictor of it? 
In answering this question positively, we have two goals. 
The first, naturally enough, is to articulate the notion 
of minimal generators of observed stochastic processes 
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and show that they exist. The second, though, is to lay 
more rigorous and broader foundations than currently 
available for the Reverse Modeling Problem — Given ob- 
servations, can one reconstruct the hidden mechanisms? 



A. Background 

Reviewing a little background on previous work will 
help put the current formal results in perspective and 
motivate our development. We then comment on closely 
related work in which similar questions arise, but which 
take different approaches to structural inference. Then, 
after outlining our approach, the mathematical develop- 
ment begins. 

If we are to build a predictive model of an informa- 
tion source that produces a time series, the most basic 
assumption to make is that the source, at each moment 
of time, is in some "state" . Over time, the source tran- 
sitions from state to state. As we noted already, though, 
in the general setting we do not have access to these 
states, we only have indirect information about them — 
information that we call measurements. So the model- 
ing question reduces to the following. Given that all we 
have are sequences of observations, what kind of "state" 
should be formed from them and used for modeling? The 
answer is rather straightforward, and seemingly tautolog- 
ical: the "states" that we should use are those that are 
effective for prediction. 

This is the starting point for how computational me- 
chanics Q, 1^ Q builds optimal models. One of the no- 
table results in computational mechanics, though, is that 
the representation, which emerges from focusing on states 
that are effective for prediction, captures all of a pro- 
cess's internal causal structure. In fact, computational 
mechanics shows that there is a preferred representation 
for modeling, which is called an e-machine. 

To start, we consider a time series of observations 
s— . . . , s_2, s_i, soj sij • ■ • , in which the individual mea- 
surements are symbols in a finite alphabet: S A. 
An e-machine consists of states — called causal states 
and denoted S — and transitions between them. The 
causal states are defined as those sets of histories s t = 
. . . , St_3, st-2, st-i that are equivalent for predicting the 
future Sf— st, St+i, St+2, ■ ■ ■■ That is, two histories — s 

and s — are associated with a given causal state, when 
the sets of possible futures "look" the same having seen 
them. More precisely, this modeling principle defines an 

equivalence relation ^ over the set S of histories: 

s ^ s if and only if P( s | s ) = P( s | s ) , (1) 

where in the conditional distribution equality we mean 
that each individual future is given the same probability. 
The resulting equivalence classes are the causal states. 

From this, one can show that the e-machine for an 
information source is the optimal, minimal, and unique 



predictor of an information source. In the language of 
mathematical statistics, the e-machine is a minimal suf- 
ficient statistic for the observed stochastic process pro- 
duced by an information source. More than being a good 
predictor that is small, the semigroup determined by the 
causal states and transitions captures all of the infor- 
mation source's internal structure — regularities, symme- 
tries, and so on. And, due to minimality, one can show 
that the statistical complexity C'^ — the "size" of an e- 
machine measured as the Shannon entropy of the set of 
causal states — measures the amount of historical infor- 
mation that the source stores. That is, e-machine mini- 
mality is not only helpful in terms of compact representa- 
tions, but it is essential, since gives one a quantitative 
way to say how structured a hidden information source 
is. 

Although the emphasis here is on the mathematical 
foundations of computational mechanics, we should note 
that it has been used to analyze structural complexity 
in a wide range of information sources. These include 
cellular automata, '3| one-dimensional maps, P, 0| and 
the one-dimensional Ising model, [^Q as well as several 
experimental systems, such as the dripping faucet, Q at- 
mospheric turbulence, geomagnetic data, ^1 complex 
materials, and molecular dynamics. |l3llT^ 

In the present work we begin to address the problems 
posed in Appendix II.3 of Ref. ^ on founding compu- 
tational mechanics more fully on stochastic process and 
measure theories by considering one part of the Forward 
Modeling Problem noted above. The results here differ 
from previous work on computational mechanics in two 
ways. First, the development is mathematically rigor- 
ous, in the sense that we use measure theory to explore 
the notion of minimal representations, which underlies 
e-machines. What is novel compared to stochastic pro- 
cess theory is that we ask for minimal representations 
of a stochastic process and express them in terms of the 
minimal tr-algebra. We also introduce two new compo- 
nents of the minimization procedure — internal-event and 
internal-state reduction — which complement the existing 
concept of causal-state reduction for e-machines. Ana- 
lyzing the Forward Modeling Problem in this way allows 
us to draw parallels with the computational mechanics 
development of e-machines, comparing and contrasting 
the various kinds of reduction method. We show that 
in a number of cases these reductions are equivalent and 
so provide an extension of the original concept of an e- 
machine to a broader class of processes than previously 
possible. 



B. Related Work 

The modeling questions that we address here, and that 
are also addressed by computational mechanics, do not 
arise in a vacuum. Here we briefly mention related work 
that is motivated by similar concerns of the equivalence of 
observed processes and of structural inference, but that 
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adopts different approaches. In a later section, when we 
turn to discuss our results, we broaden the discussion of 
related work to mention additional areas in which one 
might find useful applications. 

One of the first attempts to address the difficulties of 
analyzing (known) hidden information sources is that of 
Ref. [la. The problem, which comes under the heading of 
the identifiability of functions of Markov chains, was to 
calculate the source entropy rate, given an internal finite- 
state Markov chain, the states of which are observed with 
a probabilistic measurement function. (Note that today 
one refers to this class of information sources as hidden 
Markov models.fl^. There it was shown that in the 
majority of cases there are no closed-form expressions 
for the entropy rate. A corollary of this result is that one 
needs to determine the effective states (and these might 
be infinite in number) in order to calculate a property as 
basic as the entropy rate — that is, simply attempting to 
determine how random a finite-state information source 
is. This contrasts, of course, with Shannon's closed-form 
exp ression for finite Markovian sources. We take Ref. 
Il5f s result as one of the first indications of the nontrivial 
nature of inferring the structure of hidden information 
sources. Another testimony to this difficulty is that the 
problem of identifiability itself, though posed by Black- 
well and Koopmans in the late 1950s, was not solved for 
almost 40 years. Moreover, the existence of minimal 
representations of these same hidden sources was not es- 
tablished until a few years later still. P0II2H 

Similar concerns about inference, representation, and 
causality are found in the fields of causal inference, |22| 
graphical models, '23'| and nonlinear time series analysis 
and state-space reconstruction. 24J Most of the work in 
these areas proceeds by assuming a given set of observed 
and hidden variables (and their connectivity) and then 
asks for efficient algorithms to estimate various kinds of 
marginal, conditional, and joint distributions. The goals 
are to infer from the latter the relationship between these 
variables and so, on that basis, to draw structural con- 
clusions. That is, in these cases one begins with strong 
structural priors about the internal architecture of a hid- 
den information source in order to initiate analysis. No- 
tably, only the last of these fields concentrates on tempo- 
ral dynamics and sources with memory. Here we are in- 
terested in both architectural and temporal properties of 
memoryful hidden sources and wish to understand these 
employing a minimum of structural priors. 



C. Outline 

The principle focus of the following is to develop the 
notion of a minimal reduction of a given (hidden) Markov 
process. To do this, the development is organized as fol- 
lows. In the next section we characterize the (rather gen- 
eral) class of stochastic processes — hidden information 
sources — in a way that respects the distinction between 
a process's internal structure and the measurements, 



which indirectly reflect the internal state, available to 
an observer. This then allows us to define generators 
of stochastic processes as Markov transition kernels, and 
so state the problem of observationally equivalent gener- 
ators. The succeeding section establishes how different 
generators can be mapped onto each other while main- 
taining observational equivalence. Then, in the next sec- 
tion, we address the central problem and show that one 
can maximally reduce the representation of a process's 
internal structure — it's generator — while still producing 
the same observed stochastic process. The reduction is 
achieved in two steps — the first, called internal- event re- 
duction, produces the smallest u-algebra and the second, 
internal-state reduction, reduces the internal structure 
further, removing components that are not necessary for 
optimal prediction. During the development we illustrate 
the ideas with several examples that show how the new 
formulation extends the range of applicability of compu- 
tational mechanics. 



II. GENERATORS OF STOCHASTIC 
PROCESSES 

An information source is a process that at each time 
step emits an output or measurement symbol. Only the 
probabilistic nature of the output process is specified in 
order to describe the observed information processing. 
Indeed, often in information theory a source is mathemat- 
ically described as a stochastic process without concrete 
specification of internal mechanisms. In many theories 
of complexity, however, one often uses explicitly struc- 
tural notions (e.g., automata) from the theory of discrete 
computation|25| to describe the resources required to re- 
produce or model an observed process. So that we will 
have a mathematical model that both captures the ob- 
served stochastic process and allows for a range of inter- 
nal structures, we adapt the concept of finite-state au- 
tomata to the setting of stochastic processes as follows; 
cf. Refs. I23and[l3. 

We consider a finite set Q of internal states of the sys- 
tem and also a finite set A of output states, which are the 
observed symbols. The internal structure is modeled in 
various ways. First, it can be specified by a deterministic 
(det) transition map: 

Tdct : Q ^ Q X A, X ^ T^ix) = {y,s) . (2) 

This map assigns to each internal state x G Q the next 
internal state y and, at the same time, also the next 
output symbol s G A. Figure ^ illustrates the transition 
structure. 

A nondeterministic (non) version of such a machine 
(without input) can be introduced as a map 

Tnon : Q ^ 2«x^ , X^C . (3) 

This machine assigns to each internal state x a set C 
of possible next state pairs {y,s). This extends the de- 
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FIG. 1: The transition structure of a deterministic machine 
that generates a stochastic output process. 



(0,2) 




(Q,Q) 



FIG. 2: The Markov transition kerneh The internal structure 
is specified by (Q, Q) and the observed process by (A,©). 



terministic machine Tdet of Eq. J^jl, which can be inter- 
preted within the nondetcrministic framework as foUows: 



T„on(a:) := {Tdot(x)} £ 2' 



Finally, a further extension is provided by the follow- 
ing probabilistic (pr) interpretation of a nondeterministic 
machine Tnon: 



T 



Q X 2*3 



xA 



where 



[0, 1] , 



|cnT„o„(x)| 



|r;ion(a;)| 

The function Tpr satisfies 

rp,(x, Ci W C2) = Tp,.(x, Ci) + Tpr(x, C2) 

and 

rp,(x,Q X A) = 1 . 

Therefore Tpr is a Markov transition kernel on finite sym- 
bols. 

This interpretation allows for an extension of finite- 
state machines to machines given by general Markov 
transition kernels that are not restricted to finite sym- 
bols, for example. Here, we allow the internal states to 
be described by an arbitrary measurable space (Q, Q). 
Again, Q is the set of internal states or, in terms of proba- 
bility theory, the set of (internal) elementary events. The 
CT-algebra Q represents all internal events of interest. The 
output is modeled by a measurable space (A,I?), too. A 
machine is now considered to be a Markov transition ker- 
nel: 

T: Qx{Q(g)V), {x,C) ^ T{x,C) . 

More precisely, T is assumed to satisfy the following con- 
ditions: 




FIG. 3; The finite-dimensional marginals PJJ'^ on {A",V") 



1. For all X G Q, the function T{x, ■) is a probability 
distribution on Q^V. 

2. For all C e Q (g) V, the function T(-,C) is Q- 
measurable. 

We should point out that the well established notion 
of machines that manipulate finitely many (or a count- 
able number of) symbols may seem more appropriate 
for implementations in physical systems than our broad 
approach to computation using general Markov transi- 
tion kernels. Putting the natural ideas of computation 
into the probabilistic setting, however, allows us to em- 
ploy measure-theoretic concepts and techniques. This 
approach turns out to be very useful in understanding 
the relations between the probabilistic nature of the ob- 
served processes and the underlying internal computa- 
tional structures processes. In particular, problems on 
minimality properties of machines can be handled in an 
efficient way and for a broader class of processes than 
those over discrete symbols. 

Given a Markov transition kernel T from {Q, Q) to 
{Q X A, Q V), we consider it as a temporal "map", 
as illustrated in Fig. |21 In order to specify observable 
stochastic processes in (A,!?), we consider an initial dis- 
tribution /i on (Q, Q) and measurable sets -Bi, . . . , Bn G 
v. The finite-dimensional marginals Pl^''^ on (A", 2?") 
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are obtained by iteration of T, as shown in Fig. |3| 
This suggests the foUowing expression for the finite- 



dimensional marginals of the observed stochastic process: 



X Bn) := 



Q JQxBi 



T{Xn-l,d{Xn,yn)) ' ■ ■ T{xq, d{xi, TJi)) ^{dxo) ■ 



QxB„ 



r 



(Throughout the following d{x, y) denotes the differential 
of two variables. This notation should not be confused 
with a distance measure between x and y.) 

Proposition 2.1. Up to equivalence, there is exactly one 
stochastic process Yn, n — 1,2, ... , in (A, V), such that 
for all n € N and all Bi G P, i — 1, . . . ,n, 

Pr{ri e Si, . . . , y„ e b„} = PJi^^(Bi x • • • x s„) . (4) 

We can identify this process, or more precisely, the class 
of corresponding equivalent processes, with a probability 
distribution P'^'^ on (A^,X»f*). 

Proof. This follows from Kolmogorov's extension 
theorem. □ 

Definition 2.2. We call a Markov transition kernel T 
from (Q, Q) to (Q X A, Q (g) P) a generator and denote 
it by [{Q, Q),T, (A, 2?)] or simply by T. We say that a 
stochastic process (F„)„gN in (A,!?) is generated by T if 
there exists a probability distribution /x on (Q, Q) such 
that Eq. igji is satisfied for all n G N and all Bi e V, 
i = 1, . . . ,n. 

Given a stochastic process Y — (y„)„gN, a natural 
question is whether there always exists a generator that 
generates Y. The following trivial shift ansatz shows 
that this is indeed the case. 

Example 2.3 (Shift Generator): We set Q := A^ and 

Q := 2?^. Consider the shift map s : Q ^ Q, where 



X = {yn)neN ^ s{x) = (y„+i)„gN , 

and the projection onto the first coordinate tt 
where 



A, 



X = (y„)neN 
Furthermore, we define 



7r(x) = yi 



T{x,A X B) 



_ J 1, if s{x) e A and 7r(s(a;)) G B 
0, otherwise . 



Now, a stochastic process Y = (F„)„gN in (A, 2?) can be 
identified with a probability distribution on (Q, Q) = 
(A^,!?^). It is easy to prove that T generates Y by 
verifying Eq. Q with initial distribution /i. □ 
The shift generator (Example 2.3) is maximal in the 
sense that it generates all processes in (A, 2?). For an 
arbitrary generator T, we consider the map 



Gt : P{Q,Q) ^ P(A'^,2?^ 



Gt(m) 



(Throughout, for a general measurable space {X,X), 
P{X, X) denotes the set of probability measures on 
[X, X).) The image im(GT) of Gt is the set of processes 
that are generated by T. Here, we mainly focus on the 
following problem. 

Problem Statement 2.4: Given a generator T, 
can we find a substitute T' for T, which, on the one 
hand, generates the same set of processes, that is 
im(GT) — im(GT'); and, on the other, is minimal in 
some sense? 

From Eq. Q it follows directly that Gt is affine in the 
sense that for all /ii, fj.2 G P{Q, Q) and all < i < 1, 

GT{{l-t)^li+t^2) = il-t)GT{t^i)+tGT{m) ■ (5) 

This implies that im (Gt) is a convex set, and we have 
the following constraint on the solution of Problem 2.4: 
The set ext(im(GT)) of the extreme points of im(GT) 
represents a "lower bound" for the set Q of internal 
states. More precisely, we have the following onto map- 
ping Q -4 ext(im(GT)): 

X Sx Gt{5x) ■ 

Thus, we cannot expect to have a notion of minimality 
that reduces the internal states more than given by the 
extreme points of im(GT)- 

However, identifying internal states xi and X2 if 
Gt(^xi) = Gt{5x2) leads to a partition of Q into equiv- 
alence classes — classes that are the analogs of the causal 
states in computational mechanics. The corresponding 
canonical projection of internal states to their equiva- 
lence classes is called causal-state reduction, which is in- 
tended to reduce the internal structure in such a way that 
a given observed stochastic process is still generated by 
the reduced generator. 

This is different from the intention stated in Problem 
2.4, which is to reduce a given generator without affecting 
the whole set of observable stochastic processes. We solve 
this problem by applying reductions within a natural cat- 
egory of generators. The morphisms of this category will 
be introduced in Section UTTl Based on the results there, 
we present our reduction procedures in Section Hvl We 
leave to the future discussing causal-state reduction in 
terms of morphisms in a larger category than the one 
studied here. 
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III. TRANSFORMATION RULES FOR 
GENERATORS 

We interpret generators as objects of a category and 
define the morphisms between these objects in the fol- 
lowing way: Let [{Qi, Qi),Ti, {Ai,'Di)], i = 1,2, be two 
generators. A morphism Ti — > T2 consists of a pair (/, 17) 
of measurable maps f '■ Qi ^ Q2 and g : Ai — > A2 such 
that for all a; G Qi, A G Q2, and B G 'D2 the following 
commutativity rule holds: 

T2{f{x),Ax B) - Ti (a;, r 1(A) x g~\B)) . (6) 

The diagram in Fig. 0] illustrates this commutativity. 
With the product map 

(/x.9) : QixAi ^ Q2XA2, {x,y) ^ {f{x),g{y)), 

we can rewrite Eq. @ as 

T2ifix),AxB) = Ti{x,{f xg)-\AxB)) . 

Thus, the property of Eq. © is equivalent to 

T2(/(x),C) = T,{x,{f xgy\C)) , (7) 

with C G Q2 ® 252- Here, one has to use the fact that 
two probability measures are equal if they coincide on an 
intersection closed system of measurable sets that gener- 
ates the underlying cr-algebra. jl^ Rewriting (0 gives us 

T2if{x)r) = (/x5),(Ti(x,.)) , (8) 



where (/ x denotes the (/ x g)-ima.ge of a proba- 

bility distribution fi. 

In the following, a morphism (/, 17) is called a 
transition-preserving map. In order to define the 
composition of transition-preserving maps, we consider 
three generators [{Qi, Qi),Ti, {Ai,Vi)], i — 1,2,3, and 
transition-preserving maps {fi,gi) : Ti Ti+i, i = 1,2. 
Now define the composition as 

(/2,5'2) o (/i,.gi) := (/2 o /i,52 051) • 

We prove that this composition is a transition-preserving 
map Ti T3 by verifying Eq. 0: 

n{{f2 o h){x),AxB) =n(^f2{fi{x)),AxB) 
= T2{h{x)J^\A) x g-\B)) 
- T,(x,f^\f^\A)) x g-\g-\B))) 
= T,{x,{f2oh)-\A)x{g^og,)-\B)) . 

Proposition 3.1. Let [{Q„ Q,),T,, {A,,V,)], i = 1,2, be 
two generators, and let (/, g) be a transition-preserving 
map from Ti to T2, and let fi be a probability distribution 
on {Qi, Qi). Then, denoting the f -image of 11 by f*{p), 
for all Bi, . . . , B„ G T>2 , 

pf;(^^)'T^{B^x■■■xB^) = P>;,'^^ {g-\Bi)x- ■ -xg-^B^)) . 

Proof. With the general transformation rule for integrals 
we have 



P^M,T,{B,x---xB,,) = 11 ■■■ f T^ix'^_„d{x'^,y'J) ■■■T^{x'o,d{x[,y[))f,{fi){dx',) 

■■• / T2«_i,d(x;,?;;j) ••■ T2{f{xo),d{x[,y[)) n{dxo) 
(transformation rule) 

- W T,ix'^_„d{x'^,y'J) ■■■ if xg),{T,ixo,-))id{x[,y[))fi{dxo) 

1JQ2XB1 JQ2XB„ 



T2{x'^_i,d{x'^,y'J) ■■■ Ti{xo,d{xi,yi)) fi{dxo) 

1 ./QiXg-i(Si) JQ2XB„ 

(transformation rule) 

Ti{xn-i,d{xn,yn)) ■■■ Ti{xo , d{xi , yi)) ^j.{dxo) 



?i "'Qixs-i(Si) JQ2xg-HB„) 
= P'^^^^{g-\B,)x---xg-\B^)) 



Theorem 3.2. If Ti generates {Yn)nefi 0''^d {f,g) is a transition-preserving map from Ti to T2, then T2 



generates (g o Yn 



Proof. This statement follows directly from Proposition 
3.1. □ 
Theorem 3.2 has important and direct implications 
for two special cases. In the first case, we fix g as the 
identity map and, in the second case, we fix / as the 
identity map. In these cases, without reference to the 
identity maps, / and g are called transition-preserving. 
The implications are stated in the following two corol- 
laries. 

Corollary 3.3. Let [{Q,, Q,),T^,[A,V)], i = 1,2, be 
two generators, and let f be a transition-preserving map. 
Then 

Gti = ° /* ■ 
In particular, this implies 

im(GTi) ^ im(GT2) > 
where the equality holds if /* is onto. 

Corollary 3.4. Let [{Q, Q), Ti, (Ai, Pi)] be a generator 
of a stochastic process {Yn)neti in (Ai,Di), and let g : 
(Ai,I?i) — > (A2,I'2) be a measurable map. Then T2 : 
Q X (Q(g)D2) ^ [0,1] with 

T2{x,AxB) := T^{x,Axg-\B)) 

is a generator of the stochastic process {g o Yn) 

(A2,P2). 



IV. REDUCTIONS OF GENERATORS 

After having derived some basic transformation rules 
for generators in Section IIIII we are now ready to 
concentrate on the main problem, namely to maximally 
reduce a given generator T while keeping the set im(GT) 
of generated processes unchanged. The solution of this 
problem is given by Theorem 4.5 below and is based on a 
combination of reduction methods, which we present in 
this section. First, we attempt to reduce the cr-algebra 
Q of internal events as much as possible, by considering 
only those events in Q that are necessary for maintaining 
the output process unchanged. The following theorem 
formalizes this idea. 

Theorem (Internal- Event Reduction) 4.1. Let 

[(Q, Q), r, (A, P)] be a generator. Then there exists a 
smallest cr-suhalgebra (Tq{T) of Q with the property that 
for all C e agiT) O V, T{-,C) is a qiT) -measurable. 
The generator [{Q, aQ{T)), T, (A, V)] with the restriction 
T := r|Qx(crQ(T)®i?) then satisfies 

im(Gy) = im(GT) . 



(Qi,Qi) (Qi.Qi) 

O^; — -Q 




FIG. 4: Commutativity for generators of equivalent observed 
processes. 



Proof. Let Ai, i ^ I,he the family of all cr-subalgebras of 
Q that satisfy the following condition: for all G G Ai^V, 
T{-, C) is .4i-measurable. Now define 



aQ{T) := f|A 



Then for C € agiT) (g) V, T{-,C) is Ai measurable for 
all i € I, and therefore also ctq (T)-measurable. 
For the reason that trivially 

fiidQix),Ax D) = T{x,idQ\A) X iA-^\D)) , 

Corollary 3.3 implies that T and T generate the same 
set of stochastic processes. □ 

Theorem 4.1 guarantees the existence of a minimal suffi- 
cient cr-subalgebra of Q. Now we provide a way to calcu- 
late it explicitly in the case where we have a deterministic 
internal dynamics / : Q — > Q and a visible process given 
by a measurement 5 : Q — > A. This case generalizes the 
shift generator of Example 2.3. 



Theorem 4.2. Let {Q, Q) and (A,P) be two measurable 
spaces, and let f : Q —> Q and g : Q ^ A be two mea- 
surable maps. Consider the generator [(Q, Q),T, (A,^)] 
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defined by 

T{x, AxB) := 1 f(,A, go/GB (a;) 

^ r 1, iff{x)€Aandg{f{x))GB 
\ 0, otherwise 

Then 

aQ{T) = a{gof,gof,...) . (9) 
Proof. We prove inclusion in each direction separately. 

1. We establish that aqiT) C a{g o g o . . .) by 
showing that for all C £ a{g o f,g o . . .) ^ V, 
T(-, C) is measurable with respect to a{g o f,g o 
P,...). Prom 

(^{g ° f,gop,...)®V 

= o•((5o/,fi'o/^•••) X idA) 

we know that there exists a measurable set C € 
®V with 

C = {{gof,gof,...)xidA)~\C') . 
This implies 

T{-,C) = l(/,go/) e c 

= 1 , .-1 

U,a°f) e ((3o/,go/^...)xidA) (C) 

~ "'"((9o/,9o/",--)xidA)o(/,go/) £ C 

= 1((!?0/2,S0/3,...),S0/) e C ■ 

Thus, T{-,C) is measurable with respect to {g o 
f,gop,...). 

2. Now we prove that <Jq{T) D a{g o /, g o . . . ) 
by applying an induction argument to show that 
^(5 ° C aQ(T) for all fc = 1,2,...: 

(a) "fc = 1": Let ^ be a (5 o /)-measurable set. 

Then there exists a measurable set _B 6 P 
with A = (c/o/)-i(B). FromQxB e crQlT"), 
and 

= l(/,go/)eQxB 

= T(.,QxB) , 

it follows that A is ctq (T)-measurable. 

(b) "fc ^ + 1" : We assume that a{g op) is a cr- 
subalgcbra of uq (T) , and we have to show that 
this is also true for a{g o /'^+^). To this end, 
we choose a measurable set A e a{g o /'^+^). 
There exists a measurable set B & T> with 
A^{go p+^)-^{B), and we have 

= l/e(so/'=)-i(B) 

= T{;{gop)-\B)xK) . 

This implies A € (Jq{T), because according 
to the induction hypothesis {g o f'^)~^{B) e 



□ 

Examples 4.3. 

1. Complete Randomness. Consider a probability 
space {Q, Q,n)- This defines the following genera- 
tor [(Q, Q),T, {Q, Q)] which is completely random 
in the sense that the next internal state, which co- 
incides with the next output state, is independent 
of the current internal state: 

T: g X (Q ® Q) ^ [0, 1] , 

and 

T{x,Ax B) := ii{Af\B) . 
In this case 

<7Q(r) = {0,Q}. 

In other words, as expected, the process has no 
memory. Only a single internal event is required 
to generate the process and is the only 
process in im(G7'). 

2. Rotation of the Unit Circle. Consider the unit 
circle K = {x & C : \x\ = 1} and its upper half 
Ai = {e'f : ip e [0,7r)} and its lower half A2 = 
{e*'^ : G [tt, 2 7r)}. With a number a £ if, we 
construct the generator T according to Theorem 
4.2 using f{x) = ax and ^(a;) = fc for a; € Ai-. 
There are two qualitatively different cases: 

(a) Assume that a is a root of unity. Then there 
is a natural number p ^ with = 1. This 
implies p = idK and, therefore, 

<^QiT) = a{go f,go f,...) 

= a{gof,gof,...,gofP-') . 

Since g has just two different values, <Jq(T) 
is finite in this case and we have an effective 
internal-event reduction. 

(b) Assume that a is not a root of unity. Then 
ctq (T) is the Borel algebra of the unit circle, 
and we have no internal-event reduction. 

□ 

In addition to the reduction method given by The- 
orem 4.1, we now consider another way to reduce 
the generator's internal structure. Given a generator 
[{Q, Q),T, (A, 2?)], we identify each two elements xi,X2 € 
Q if T(xi, •) = T{x2, •)• The equivalence class of x is de- 
noted by [x]. Furthermore, we define 

[Q] := {[x] : xGQ} 

and 

[Q] := {A' C [Q] : e Q} . 
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The (T-algebra [Q] is just the terminal algebra of the 
canonical projection [•] : x i— > [x]. It is easy to see that the 
following transition kernel [T] : [Q] x ([Q] (g) X>) ^ [0, 1] 
is well defined 

[mx],A' xB) := T{x,[-]-\A') X B) . 

Theorem (Internal-State Reduction) 4.4. 

Let [(Q, Q), T, (A, 23)] be a generator. Then 
[([Q], [Q]), [T], (A, P)] is a generator, which gener- 
ates the same set of processes in (A, 2?) as T, that 
is, 

im(G[T]) = im^Gr) ■ 

Proof. We show that [T] is a Markov transition kernel in 
two stages. 

1. We fix [x] and prove that [T] ([x], •) is a probability 
measure: 



[T] \^[x], l+J C„j = T l^x, ([•] X idA) ' (^l+J C„ 

/ oo ^ 

= T X, l+J ([•] xidA)"'(ao 

V n=l 

= J2T(x,{[-]xidAy\a,) 

oo 

= 5][T]([x],C„) , 



n=l 



and 

[T]{[x],[Q]xA) 



T (a:, ([•] X idA) \[Q]xA] 
T{x,Qx A) 



= 1 



2. Now we fix C e [Q] ® V and prove that [T]{-,C) 
is [Q]-measurable. To this end, it is sufficient to 
prove that for all e with < e < 1, the set 
{[T](-, C) < e} is an element of [Q] or equivalently 
[•]-i({[r](-,C) < e}) e Q. This is shown as fol- 
lows. 

= ir^m^iQ] ■■ T'{[x],C)<e}) 
^ {xeQ : [T]{[x],C)<e} 

^ [x e Q : T (^x, ([•] X idA)"\c)) < e} 

e Q. 

□ 

Combining the reduction methods provided by Theo- 
rem 4.1 and Theorem 4.4, we can reduce every generator 



to a minimal generator. This statement is specified in 
the following theorem. 

Theorem (Solution of Problem 2.4) 4.5. 

Let [(Q, Q), r, (A, 2?)] be a generator, and let 
[(Q', Q'), r', (A, 23)] be the generator obtained from 
T by applying first the reduction method of Theorem 4-1 
and then the method of Theorem 4-4- Then T' satisfies 



hn{Gj 



and is minimal in the sense that given another generator 

[{Q",Q"),T",{^,'D)] with im{GT") = MGt'), every 
transition-preserving map f from T' to T" is infective. 
Proof. Again there are two steps. 



1. We prove 



<fo[]) = aQ{T) . 



(10) 



(a) "C": This inclusion follows directly from the 
measurability of 



(Q,Q) 



([g],[aQ(T)]) = (g',Q' 

(Q",Q") . 



(b) "D": Let C G cr(/ o [■]) ^ 23. We prove 
that T(-, C) is (/o [•])-measurable, from which 
(^q{T) C a{f o [•]) follows, because <jq(T) is 
the smallest a-algebra with that invariance 
property: From 

a(/o[.])®23 = a(/o[.])®a(idA) 
= a((/o[.]) xidA) , 

it follows that there exists C" e Q" ®V with 

((/°H)xidA)"'(C") = c. 

This implies the (/ o [•])-measurability of 
T{;C): 

T{x,C) = r(:r,((/o[.])xidA)"'(C")) 
= T"((/o[.])(x),C") 
= {T"{;C")o{fo[.])){x) . 

2. Using Eq. (|10|l . we now prove that / is injective. 
Assume /([xi]) = f{\x2\) where [ equivalence 
classes in Q\ that is, [xi], [^2] G [Q]. In order to 
prove injectivity of /, we have to show [xi] = [^2]: 

r(xi,c) - r(xi,((/o[.]) xidA)"'(c")) 

- r"(/([xi]),c") 
= r"(/([x2]),c") 

= r(x2,((/o[.])xidA)"'(C") 

= r(x2,c) . 
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□ 

Examples (Continuation of Examples 4.3) 4.6. 

1. Complete Randomness. Applying the internal- 
state reduction leads to an internal state space Q' 
consisting of one point, namely Q' — {Q}. The 
reduced generator is then given by 

T'{x,{Q}xB) = /i(B) . 

2. Rotation of the Unit Circle. 

(a) Identifying points according to the internal- 
state reduction leads to the grouping of all el- 
ements in a given atom of the finite cr-algebra 
(jq{T). Thus, in this case we have finite tran- 
sition kernel T' resulting from Theorem 4.5. 

(b) In this case, the internal-state reduction leads 
to equivalence classes that consist of individ- 
ual points, so that effectively there is no re- 
duction. 



As pointed out at the end of Section |nl our goals 
differ from those underlying causal-state reduction in 
computational mechanics. Nonetheless, it is not hard 
to see the following close relationship: In the situation 
of Theorem 4.2, identifying xi and X2 if and only if 
Gt{5xi) — Gt{Sx2) is equivalent to the identification 
of xi and X2 if and only if T{xi,C) = T{x2,C) for all 
C G (Tq{T). The first identification leads to the analogs 
of the causal states in computational mechanics and the 
second identification is the one used in Theorem 4.5. For 
completeness, we conclude this section with the proof of 
this relationship. 

Corollary 4.7. Let [{Q, Q),T, (A, 2?)] be a generator as 
in Theorem 4-2, and let Xi,X2 G Q- Then 

Gt{Sxi) = Gt((5x2) 

is equivalent to 

T{xi,C) = T{x2,C) for all C e (7q{T) . 

Proof. 

Gt{Sx,) ^ Gt{Sx,) 
^ 9{f\x,)) ^ 9{f\x2)) 
for all = 1,2,.. . 

l(ffo/,go/2,...)eC'(a;i) = l(go/,go/2,...)eC'(a;2) 

for all C" G 

^ T{xi,C)^T{xi,C) 

for all Cgct{5o/, 50/2,...} 
^ T(xi,C) =T(a:i,C) 

for aU C G f7Q(T) (Theorem 4.2) . 

□ 



V. DISCUSSION 

After this long development, it will be helpful to dis- 
cuss more informally what was achieved and how to in- 
terpret the results. We began by characterizing the class 
of hidden information sources in a way that respected the 
distinction between a source's internal structure and its 
observed process. That allowed us to define generators 
of stochastic processes as Markov transition kernels and 
to state the problem of observationally equivalent gener- 
ators. We then established how different generators can 
be mapped onto each other while maintaining equiva- 
lence of the observed stochastic process. We showed that 
one can maximally reduce the representation of a source's 
generator under the same constraint. The reduction was 
achieved in two steps: first by internal-event reduction 
which produced the smallest cr-algebra and the second by 
internal-state reduction which collapsed cr-algebra com- 
ponents redundant for optimal prediction. 

"Prediction" here refers to the hidden internal state 
and to the observed state of the machine in the next 
time step. Within computational mechanics, however, 
predictions are made for the whole future of the observed 
process, which seems more natural than trying to make 
predictions of the hidden states. For the class of genera- 
tors that have the structure of Theorem 4.2 it turns out 
that both approaches are equivalent (see Corollary 4.7). 
We expect this equivalence to be valid for a larger class 
of generators but leave this to future investigations. 

One interpretation of these results is that the seem- 
ingly intractable nonuniqueness of inferring models of 
hidden information sources can be directly addressed. 
There are more constraints on one's choice of represen- 
tation than one thinks, at first blush. The new reduc- 
tions and their sometimes-equivalence to e-machine rep- 
resentations suggest that there might be a preferred min- 
imal representation of general stochastic processes — the 
e-machine or some generalization of it. Even if these 
minimal models are unachievable when inferring from fi- 
nite data, nevertheless, they are the goal toward which 
modeling should strive. We hoped to show, and partly 
illustrated this by the examples, that the new formula- 
tions of reductions and their relationship to causal-state 
reduction greatly extends the class of processes to which 
computational mechanics can be applied. 



VI. APPLICATION AREAS 

The developments here properly lie in the domains of 
measure theory and stochastic processes. However, we 
believe the results on reductions are relevant to a num- 
ber of areas outside of those fields. To emphasize this, 
and also to suggest possible directions for future work, we 
shall point out the similarities with some areas and pos- 
sible applications that would follow from the similarities. 
The areas considered are not, by any means, exhaustive. 
The observations are intended only to be suggestive. 
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Very generally, in statistical physics theories assume 
that a system is Markovian. There is, for example, 
little concern about minimal representations. One con- 
sequence of this is that one sees an only indirect interest 
in calculating the structural and information-processing 
properties of physical systems. Historically, as reflected 
in the invention and use of order parameters, structural 
aspects are what the theorist introduces at the beginning 
of analysis. The difficulty that arises is that the systems 
of genuine interest often produce "order" — behaviors and 
structures — that is not directly determined by the fun- 
damental equations of motion, but only arises over long 
times and large spatial scales. In these cases, one must 
adopt something like the inferential stance to discovering 
the emergent order, rather than assume it at the outset. 
All of which is to say that applying the reductions dis- 
cussed here to problems in statistical mechanics should 
lead to novel and useful notions of structure and to quan- 
titative methods for measuring degrees of structuredness. 

In communication theory hidden information sources 
are called channels. Overwhelmingly, the cases that 
are considered and analyzed and that, more importantly, 
are the basis for the central results of information theory 
assume channels with no memory. (29j Here, though, in ef- 
fect we addressed channels with memory in the sense that 
the output symbols were not in one-to-one relationship 
to the channel's internal states. Indeed, to the extent the 
set of causal states is nontrivial, then one is confronted 
with memoryful information sources. Looking forward, 
the results on reductions should help in analyzing mem- 
oryful information sources and in quantitatively address- 
ing the size of encoders and decoders under fixed channel 
fidelity. 

VII. CONCLUSION 

The process of model building is sometimes charac- 
terized as equivalent to data compression. While this 
might be true from a pragmatic engineering perspective, 
from the scientific, one must disagree. Model building 
is much more than data compression, especially to the 



extent that one attempts to explain and understand hid- 
den structures and mechanisms. (See, for example, the 
discussion in the last section of Ref. yfl) 

Building a good model certainly helps with compress- 
ing the original data, since the predictable components of 
a process that the model captures can be used in encoding 
and decoding to send only the "random" portions. How- 
ever, the goal of modeling in the sciences is understand- 
ing the (possibly hidden) mechanisms and structures — 
elements that help explain observed phenomena and lead 
to new insights about how nature organizes itself. In this, 
minimal models — the theme of the present work — play a 
particularly important role. Not only do small models 
make for more tractable analysis and manipulation, they 
express how a process is structured and, in this, they 
allow for improved scientific understanding. 

Here we addressed the Forward Modeling Problem of 
maximally reducing a given generator while keeping the 
observed process unchanged. Future work will focus on 
the Reverse Modeling Problem, the goal of which is to 
construct a minimal generator based on a distribution of 
measurement sequences alone. We envision a two-step 
approach. In the first, one constructs a possibly large 
but sufficient generator that, in the second step, is re- 
duced using the results developed above. Unfortunately, 
the problem of ambiguity arises at the end of this proce- 
dure. From previous work in computational mechanics, 
however, we expect uniqueness of minimal generators up 
to isomorphism. 
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